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The goal of the Machine Learning and Traveling Repairman Problem, (ML&TRP) is to 
determine a route for a "repair crew," which repairs nodes on a graph. The repair crew aims 
to minimize the cost of failures at the nodes, but as in many rc^al situations, the failure 
probabilities are not known and must be estimated. We introduce two formulations for 
the ML&TRP, where the first formulation is sequential: failure probabilities are estimated 
at each node, and then a weighted version of the traveling repairman problem is used to 
construct the route from the failure cost. We develop two models for the failure cost, based 
on whether repeat failures are considered, or only the first failure on a node. Our second 
formulation is a multi-objective learning problem for ranking on graphs. Here, we are 
estimating failure probabilities simultaneously with determining the graph traversal route; 
the choice of route influences the estimated failure probabilities. This is in accordance with 
a prior belief that probabilities that cannot be well-estimated will generally be low. It also 
agrees with a managerial goal of finding a scenario where the data can plausibly support 
choosing a route that has a low operational cost. 

1. Introduction 

We consider the problem of routing an agent ("repair crew"') on a graph, where each node 
in the graph has some probability of failure. The probabilities are not known and must 
be estimated from past failure data. Ideally, the nodes that are most prone to failure 
should be repaired first, but if those nodes are far away from each other, the extra time 
spent traveling between nodes might actually increase the chance of failures occurring at 
nodes that have not yet been repaired. In that sense, it is better to construct the route 
to minimize the cost of the failures, taking into account the travel time between nodes 
and also the (estimated) failure probabilities at each of the nodes. We call this problem the 
machine learning and traveling repairman problem (ML&TRP), and in this work, we present 
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systematic approaches to formulating and solving this problem. There are many possible 
applications of the ML&TRP, including the scheduling of safety inspections or repair work 
for the electrical grid, oil rigs, underground mining, machines in a factory, or airplanes. 
Another example is to route delivery trucks that carry items that may be damaged if the 
items are in the vehicle too long (e.g., ice cream or other groceries). 

We present two formulations for the ML&TRP. The first formulation is sequential: the 
failure probabilities are estimated, and then the probabilities determine a graph traversal 
cost for each possible route. The route that minimizes the graph traversal cost is then de- 
termined by solving a weighted traveling repairman problem (TRP), also called a minimum 
latency problem, or more generally, a time-dependent traveling salesman problem (see for 
instance, Picard and Queyranne, 1978). The second formulation computes the probabilities 



and the route simultaneously, by minimizing an objective with two terms. The first term 
is a training error term used for estimating probabilities and the second term is the graph 
traversal cost. This means that estimated failure probabilities are chosen together with 
knowledge of the graph traversal cost. The graph traversal cost acts as a regularization 
term, and has a tendency to lower probability estimates and promote generalization; this 
is in accordance with a prior belief that the probabilities will be low if they cannot be 
well-estimated. The algorithm will thus prefer routes where the first nodes visited actu- 
ally have higher failure probabilities. Another reason to incorporate a prior belief that the 
graph traversal cost will be low is managerial. A company might wish to know whether 
it is at all possible that a low-cost route can be designed, where the operational costs are 
realistically supported by the data. Among all reasonable probability models, our second 
formulation will find one that corresponds to the lowest-cost route. This type of formulation 
is optimistic; it provides the best possible (but still reasonable) scenario described by the 
data. 

We design the graph traversal cost in two ways, where either traversal cost can be used 
for either the sequential or the simultaneous formulations. The first graph traversal cost 
is the expected cost of failures, where multiple expected failures at the same node each 
contribute to the cost. The second graph traversal cost is inspired by (but not equal to) 
the expected cost of the first failure at each node. The first cost applies when the failure 
probability of a node does not change until it is visited by the crew, regardless of whether 
a failure already occurred at that node, and the second cost applies when the node is 
completely repaired after the first failure, or when it is visited by the repair crew, whichever 
comes first. Thus, the choice of cost function should depend on the types of failures and 
repairs considered in the application. 

In the second formulation, regularizing by the graph traversal cost limits the complexity 
of the hypothesis space used for the probabilistic model. This means that the graph traversal 
cost term may assist with generalization, that is, the probabilistic model's ability to predict 
well on data drawn from the same distribution. In this work we present a generalization 
bound showing how a limitation on the graph traversal cost might lead to a more accurate 
model for failure probabilities. 

The ML&TRP problem does not fall under the umbrella of semi-supervised learning 
since the incorporation of unlabeled data is used for determining the route cost, and is 



not used to provide additional distributional information (see for instance, Chapelle et al. 



2006). Also our approach to regularization along the route is entirely different from work 
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on graph regularization (Agarwal, 2006 Belkin et al. , 2006; Zhou et al. , 2004), which does 



not concern traversal routes. In that work, there is an assumption that probabihties will 
be smooth along the graph. This assumption is not true for many applications; see for 
instance, the power grid application discussed below. 

We will discuss a motivating example for the ML&TRP in the remainder of the intro- 
duction. In Section [2] we will outline the two general formulations, and provide the two 
methods for computing the route cost. In Section [3] we provide mixed-integer nonlinear 
programs (MINLP's) for solving the ML&TRP. Section [i] gives relevant illustrations, along 
with some experiments on data from the NYC power grid. Section [5] contains the theoretical 
generalization result, and in Section [6] we discuss related literature. 

Motivation 

One particularly motivating application for the ML&TRP is smart grid maintenance. Since 
2004, many power utility companies are implementing new inspection and repair pro- 
grams for preemptive maintenance, whereas in the past, all repair work was done reactively 



(Urbina, 2004). An example of this is vented manhole cover replacement programs, where 
each manhole in a city is replaced with a vented cover that allows gases to escape, mitigating 
the possibility and effects of serious events including fires and explosions. 

New York City's power company Con Edison, which has such a replacement program, 
services tens of thousands of manholes in each borough, and it is not sensible for a repair 
crew to travel across the city and back again for each cover replacement. The scheduling 
of manhole inspection and repair work in Manhattan, Brooklyn and the Bronx is assisted 
by a machine learning model that estimates the probability of failure for each manhole 



within a given year (Rudin et al. , 2010). Features for the model are derived from physical 
characteristics of the manhole (e.g., number of cables entering the manhole), and features 
derived from its history of involvement in past events. Repeat failures (serious and non- 
serious events) can occur on the same manhole. That said, failures are rare events, and it is 
not easy to accurately estimate the probability that a given manhole will fail within a given 
period of time. The current model for estimating failures does not take into account the 
route of the repair crew that replaces the covers. This leaves open the possibility that, for 
this domain and for many other domains, estimating the failure probabilities with knowledge 
of the route optimization procedure could lead to an improvement in repair operations. 

The features for the NYC machine learning models are recomputed periodically, but not 
often, due to the expense of processing the raw data, and the fact that these probabilities 
change very slowly with time. Because of this, the route must be determined before the 



work starts. Also, the probabilities are not smooth along the graph, as discussed by Rudin 



et al. (2010). In initial attempts to model these probabilities on the Manhattan power grid, 
estimates were smooth geographically, and this type of model did not perform nearly as well 
as a more targeted model. In Manhattan it is very common to have relatively vulnerable 
manholes right next to manholes that are not vulnerable. 

The limited resources for inspection and repair of manholes should generally be des- 
ignated to the most vulnerable manholes. With uncertainty in many of the probability 
estimates, if we are not careful, it is possible that most of these resources will be spent 
in dealing with outliers whose probabilities are overestimated. Our second ML&TRP for- 
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mulation aims to prevent this from happening. At the same time, that formulation may 
be able to find a solution that is supported by the data, but also meets operational cost 
requirements or targets. 

2. ML&TRP Formulations 

We first provide some notation. We are given two sets of instances, {xi}^^, {xi}f£i, with 
Xi e X, Xi e X that are feature vectors with X C W^. Let the xf indicate the j-th 
coordinate of the feature vector Xj. For the first set of instances, we are also given labels 
{yi}i^n Hi £ These instances and their labels arc the set of training examples. 

The other instances {xi}fl^'^aie unlabeled data that are each associated with a node on a 
graph G, where the distance between nodes i and j, (ij j G R+ serves as the weight on the 
edge connecting them. The graph defined by them is complete. A route on G is represented 
by a permutation vr of the node indices 1, . . . , M. Let 11 be the set of all permutations of 
{1, M}. A set of failure probabilities will be estimated at nodes and these estimates will 
be derived from a function of the form fx, where f\ : X ^M., fx ^ F. The class of possible 
function models F is chosen to be the set of linear combinations of the feature coordinates: 

F ■.= {f : f{x) = A • X for some A G M"' such that ||A||2 < Mi}. (1) 

where Mi is a fixed positive real number. We can easily make the function class more 
expressive by using affine functions ( such as g{x) = \ ■ x + d). But since g{x), like /(x), 
can be written as an inner product of a new parameter [A d\ and a new feature [x 1], we 
can append our feature vectors to form new vectors where the last coordinate of x is always 
1 and use function class F as described. 

For the smart grid maintenance example, the training instances might be manholes 
represented by features derived from data prior to 2010, and the labels might encode whether 
the manhole experienced a serious event within 2010. The test instances might represent 
features derived from data prior to 2011, and the goal is to create a repair route that 
minimizes the cost of repairs in 2011. 

The sequential formulation for the ML&TRP follows two steps. The TrainingError and 
GraphTraversalCost objectives will be defined shortly. 

Sequential Formulation 

Step 1. Compute the scores f^{xi): 

fx e argminj^g^TrainingError(/A,{xi,yi}^i). 

Step 2. Compute a route corresponding to the scores: 

TT* G argmin^gnGraphTraversalCost(7r, fl,{xi}fii, {dijjfj^i). 

The result tt* G 11 is the route used for the repair crew. In the first step, a transformation 
of fx{x) yields an estimate of probability of failure P{y = l\x). Here, {y = 1} is the event 
that a failure occurs on any given day or time step. Let the distances be scaled appropriately 
so that a unit of distance is traversed in a unit of time. We assume that the probability of 
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failure is the same at each time step until something happens (either the crew visits, or a 
failure occurs), and that x docs not change over the time that the route is being traversed. 
To ensure that these probabilities are in agreement with past observations, we choose fx{x) 
to minimize a training error in Step 1. In the second step, the route is chosen to minimize 
a weighted TRP cost based on those estimated probabilities. 

The sequential formulation is easier to solve than the simultaneous formulation outlined 
below. 

Simultaneous Formulation 
Step 1. Compute the scores fx{xi)' 

fl G argminj^gjr TrainingError(/A, {xj, yj™ i) 

+Ci min GraphTraversalCost (tt, /a, {<iij}ij=i) 

Step 2. Compute a route corresponding to the scores: 

TT* G argmin^gnGraphTraversalCost(7r,/A,{xi}i^i,{dij}j^=i). 

The result tt* G 11 is the route used for the repair crew. In the simultaneous formulation, 
the model must not have a high graph traversal cost, and must yield probability estimates 
that agree with past observations. The constant Ci is a tradeoff parameter between the 
accuracy of the predictive model and the cost to traverse the graph. Ci needs to be scaled 
relative to the training error, m, M, and distances di^j. If C\ is very small or if the first 
term is very large compared to the second, the algorithm essentially becomes sequential; 
first the training error is minimized without knowledge of the possible repair routes, and 
the graph traversal route would be the same as if it were determined after the model f^. 
In some cases, this may be appropriate, for instance if the number of training examples is 
extremely large, then there may be little flexibility in the choice of model 

In what follows, we define the TrainingError and GraphTraversalCost objectives. 

2.1 Training Error Term 

The training error term includes a sum of losses over the training examples: 

m 
i=l 

where the loss function If can be any monotonically decreasing different iable function 
bounded below by zero. We choose the logistic loss: lf{x, y) := In (l + e"^-'^^*-^-') so that the 
probability of failure P{y = l|a;), is estimated as in logistic regression by: 

P(y=l|x)=p(.):=^^A_. (2) 

We are thus assuming that the log-odds ratio of the class posterior values P{y = ■\x) can be 
represented by an affine/linear function of the features x. The training error corresponds 
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to the negative log likelihood: 

m m 

-log likelihood =^-ln p{xi)^^+y^'^/\l - p{x,))^^-y^^/^'^ = ^ In (^1 + e^^'^^^^^: 

1=1 i=l 

We include an £2 penalty over the parameters A. The regularized training loss is now: 

m 

TVainingError(A,{xi,yJ^i) := ^ In (l + e'^^^^^^^)) + CsHAHi (3) 

i=l 

where C2 is the corresponding regularization coe fficient. Another possible los s function is 



the exponential loss e yif^(^i\ used in boosting (Schapire and Freund 2011), which also 



corresponds to a probability model, though we will not use it here. 
2.2 Two Options for the Graph Traversal Cost 

The graph traversal cost can be defined to match the application. We present two options. 
In Cost 1, for each node there is a cost for failure event {y = 1} in every time step prior to 
a visit by the repair crew. There can be repeated failures, where each failure has the same 
cost. In Cost 2, there is a cost for the first event on a node prior to its visit by the repair 
crew. Cost 2 assumes that the repair crew wants to reach the node before its first failure, 
and that there are no additional failures after the first one. Both Cost 1 and Cost 2 can 
be appropriate for various power grid applications. Cost 2 is appropriate for the delivery 
truck application, where perishable items can fail (once an item has spoiled, it cannot spoil 
again). Both graph traversal costs use the predictions on the nodes, f\{xi). 

We can provide a very natural interpretation in terms of a simple stochastic process for 
both these costs. Consider that there is a continuous time stochastic process at each node 
Xj, which when discretized by time steps of appropriate duration, can be approximated 
by a Bernoulli process with parameter p{xi). We will see later on that for Cost 2, the 
random variables at each time step are required to be independent whereas for Cost 1 no 
such restriction is necessary (because of linearity of expectations). Such a stochastic process 
perspective is useful when one designs their own, possibly more complex, cost models. 

We assume for convenience and without loss of generality that after the repair crew visits 
all the nodes, it returns to the location of the starting node, which is fixed beforehand to be 
node 1 (vr(l) = 1). This assumption simplifies the exposition of the paper, but in doing so 
excludes the scenarios where one might not be interested in returning to the starting node 
or when one wants to start from a separate depot. Either of these cases would require a 



slight change in the cost model, as discussed for instance by Ezzine et al. (2010), and would 
not change the computational complexity of finding a solution. Another setting which one 
might be interested in is of finding an optimal tour route without specifying a starting 
point. In this case, problem formulation with a fixed starting node can be readily used. 
In particular, we can obtain a solution by choosing each node in turn to be the starting 
node, solving with a fixed starting node formulation M times, and picking the best of the 
M solutions. Because such extensions can be performed relatively easily, we are assuming 
a fixed known starting node in our formulations. 

Let a route be represented by tt : {1, ...,M} 1— )■ {1, ...,M}, this means that 7r(i) is the 
i^^ node to be visited. For example, let M = 4, vr = [2, 3, 4, 1]. This means, 7r(l) = 2, node 
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2 is the first node to be visited, 7r(2) = 3, node 3 is the second node on the route, and so 
on. The latency of a node 7r(i) with respect to route tt is the time (or equivalently distance) 
at which node 7r(i) is visited. It is the sum of distances traversed before position i on the 
route: 



L^(7r(i)) := time at which node 7r(i) is visited 



Zlfeil ^7r(fc)7r{fc+l)l[fc<j] i = 2,...,M 



Ylk=l dn{k)TT{k+l) i — 1- 

(4) 

The assumption that the final node is the first node means that d^[M)n{M+i) = '^7r(Af)7r(i)- 
The starting node 7r(l) thus has a latency L^(7r(l)) which is the total length of the route 
starting at node '7r(l) and ending at node 7r(l) after visiting all other nodes. 

Cost 1: Cost is Proportional to Expected Number of Failures Before the 
Visit 

Up to the time that node 7r(i) is visited by the repair crew, there is a probability that 
a failure will occur within each unit time interval. Equivalently, within each unit time inter- 
val, failures are determined by a Bernoulli random variable with parameter Thus, 
in a time interval of length L.,^[TT{i)) units, the number of node failures follows the Binomial 
distribution Bin (L7r(7r(i)),p(xjr(j))). For each node, we will associate a cost proportional to 
the expected number of failures before the repair crew's visit, as follows: 

Cost of node ■n{i) oc ii^(number failures in Ltj{tt{i)) time units) 

= mean of Bin(L^(7r(ii)),p(x^(i))) =p(x^(i))L^(7r(i)). (5) 

Using this cost, if the failure probability for node '/r(i), namely p(x,r(i))) is small, we can 
afford to visit it later on during our graph tour when the latency L.j^{7r{i)) is larger. If 
Pi^Tr(i)) is large, we should visit node 7r(z) earlier to keep our overall graph traversal cost 
low. 

The total cost of route vr is: 

M 

GraphTraversalCost(7r, fx, {xi}fii, {di,j}f'j=i) = ^p(5,r(j))-^7r(vr(i)). 

i=l 

Substituting the definition of LT^{7r{i)) from Q: 

GraphTraversalCost(7r, fx, {xi}fii, {dijjf'j^i) = 

MM M 

'^P{^TT{i))'^d^{k)TT{k+l)'^[k<i] +P(57r(l)) ^'^7r{/c)7r{fc+l): (6) 
i=2 k=l k=l 



where p(x^(j)) is given in (j2j). This will be Cost 1. 

There are ways to make Cost 1 more general. The individual node cost in ^ assumes 
that the node failure probability p(x7r(j)) becomes zero after the repair crew's visit, so that 
for the remainder of the route, the cost incurred at this node is oc x (L7r(7r(l)) — Lt^['k{i)). 
We could relax this by assuming p(x7r(i)) does not vanish after the repair crew's visit and 
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adding an additional cost for the expected failures in this period. That is, if /3 is a constant 
of proportionality for the cost after visiting node TT{i), then the cost would become: 

Cost of node 7r(z) = f3 [LniTT{l)) - L7r(7r(f))] + L^(7r(i))p(x^(j)). 

If /3 = 1, then the repair crew does not have any effect and cost of each node is independent 
of its expected number of failures before the repair crew's visit. Typically, we expect that 
the repair crew will repair the node so that it will not fail, and the second term above is 
much larger than the first. Taking the constant of proportionality as /3 = 0, we return to 
the individual costs given by 

Note that since the cost is a sum of M terms, it is invariant to ordering or indexing 
(caused by vr). Thus we can rewrite the cost as 

M 

GraphTraveTsalCost{TT, fx,{xi}tii,{dij}ifj^i) = ^p(xi)L^(i), 



1=1 



where is given in ([2|). 



Cost 2: Cost is Proportional to Probability that First Failure is Before 
THE Visit 

This cost reflects the penalty for not visiting a node before the first failure occurs there. 
This model is governed by the geometric distribution: the probability that the first failure 
for node 7r{i) occurs at time L.^{TT{i)) is — p{x^(^i-j))^^^'^^^^^~^ , and: 

p(first failure occurs before time L,r(vr(i))) = 1 - (1 - (7) 



The cost of visiting node 7r(i) will be proportional to this quantity. Substituting the ex- 
pression Q for ij(x,r(j))- 

Cost of node 7r(i) oc 1 — ( 1 



I _|_ g-f^{^7v(i)) 

l_(l + eA(^.«))-"^^"^^^^V (8) 



Similarly to Cost 1, L.,^{'K{i)) influences the cost at each node. If we visit a node early in 
the route, then the cost incurred is small because the node is less likely to fail before we 
reach it. Similarly, if we schedule a visit later on in the tour, the cost is higher because the 
node has a higher chance of failing prior to the repair crew's visit. We generally want to 
visit nodes with higher probability of failures early and schedule the less vulnerable nodes 
later. The total route cost is thus: 



GraphTVaversalCost(7r, /a, {ii},=i, {di,j]f^j^i) = X] ^ " ( ^ + e^^^^-f^))) . (9) 



i=l 
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This cost is not directly related to a TRP cost in its present form. That is, when the 
failure probabilities of the nodes are all the same, the total cost is not linear in the latencies, 
as is the case for Cost 1. Loosely inspired by the route cost above, we will derive a cost that 



is similar to a weighted version of the TRP in Section 3.2, choosing it to be of the form: 



Cost of node 7r(i) oc L^(7r(i)) log (l + e^^^^-w^) , (10) 



as an alternative to Q. 

There is a slightly more general version of this formulation (as there was for Cost 1), 
which is to take the cost for each node to be a function of two quantities: the probability 
of failure before the visit, and the probability of failure after the visit. Let us redefine (3 
to be a constant of proportionality for the cost of visiting before the failure event. From 
the geometric distribution, ^(first failure occurs after time Lt^{t:{i)) = (1 — p(i7r(i)))^''^'^*'*^\ 
and the cost of visiting node 7r(i) becomes: 

Cost of node 7r(z) oc P(failure before L,r(7r(i))) + /3-P (failure after L7r(vr(i))) . (11) 

If /3 = 1, then the sum above is 1 for all nodes regardless of node failures or latencies. More 
realistically, the cost of visiting the node after the failure is more than the cost of visiting 
proactively, /3 <C 1 leading to ([s]). 

We could again have written the summation to hide the dependence on vr: 

GraphTraversalCost(7r, A, {xi}f^^, {di,j}f^j=^ = ( ^ " + e-^^(^'>))~ ' j . 

i=i ^ ^ 

Now that the major steps for both formulations have been defined, we will discuss methods 
for optimizing the objectives. 

3. Optimization 

We start by formulating mixed- integer linear programs (MILP's) for the graph traversal 
cost subproblem. 

3.1 Mixed-integer optimization for Cost 1 

For either the sequential or simultaneous formulations, we need the solution of the subprob- 
lem: 



vr* G argmin^gnGraphTraversalCost(7r,/J^,{xi}fii,{(ijj}*;^^i), 

MM M 

(12) 



i=2 k=l k=l 



Let us compare this to the standard traveling repairman problem (TRP) problem (see Blum 



et al.[|1994D : 

M 



vr* e argmiivgn ^ d^(k)Tv(k+i){M + l-k). (13) 



k=l 



9 



TULABANDHULA ET AL. 



The standard TRP objective (13) is a special case of weighted TRP (12) when Vi = 
1,...,M, p{xi) =p: 

M M M 

7r{fc)7r{fc+l) 

i=2 k=l k=l 

M M M 

= p'^'^d^{k)7T{k+i)'i-[k<i]+P^d 

1=2 k=l k=l 
M M M 

= X '^7r(fc)7r(fc+l)l[fc<i] + ^7r(fc)7r(fc+l)l[fc<M+l] 

i=2 k=l k=l 
M M+1 

= P y^ C?7r(A:)7r(fc+l) ^ l[A:<i] 
k=l 1=2 
M 

= p'^d^(^i,)^(^k+i){M + l-k). 

k=l 

The TRP is different from the travehng salesman problem (TSP); the goal of the trav- 
eling salesman problem is to minimize the total traversal time (in this case, this is the same 
as the distance traveled) needed to visit all nodes once, whereas the goal of the traveling 
repairman problem is to minimize the sum of the waiting times to visit each node. Both the 



TSP and the TRP are known to be NP-complete in the general case (Blum et al. , 1994). 
Intuitively, a TRP route cost objective captures the total waiting cost of a service system 
from the customer's (the node's) point of view. For example, consider a truck carrying 
prioritized items to be delivered to customers. At each customer's stop, that customer's 
item is removed from the truck. The goal of the TRP is to minimize the total waiting time 
of these customers. 

We need to extend the standard TRP to include "unequal flow values" that will accom- 



modate the more general problem (12). We use as a starting point the work of Fischetti 



et al. ( 1993 ) who give an integer programming formulation of the standard TRP. (Note that 



there are usually many ways that an integer program can be constructed, see Mendez-Diaz 



et al. , 2008). We will suitably introduce pre-defined weights {p{xi)}i to arrive at (12) as the 
objective. For Cost 1, we will take p{xi) := p{xi) = 1/ (1 + exp {—fx{xi))), though we leave 
the weights in the formulation more generally as p{xi). In order to interpret the formulation 
below, consider the sum of the probabilities YliLiPi^i) ^ the total "flow" through a route. 
At the beginning of the tour, the repair crew has flow "YldLiPi^i)- Along the tour, flow of 
the amount p{xi) is dropped when the repair crew visits node i at latency L-,^{'K{i)). In this 
way, the amount of flow during the tour is the sum of the probabilities p{xi) for nodes that 
the repair crew has not yet visited. 

We deviate from the tt notation to represent routes in the mixed-integer program. In- 
stead, we introduce two sets of variables {zij}ij and {yij}ij which can together represent 
a route. Let the set of edges of graph G be denoted by E. Let Zij represent the flow on 
edge £ E and let a binary variable yij represent whether there exists a flow on edge 
(i,j) € E. (There will only be a flow along the route, and there will not be a flow along 
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edges that are not in the route.) The mixed-integer program is as fohows: 

M M 

min^^djjZjj s.t.(14) 



i=lj=l 



No flow from node i to itself: Zi^i = = 1, M(15) 
No edge from node i to itself: yi^i = Vi = 1, M(16) 



M 



Exactly one edge into each node: j/jj = 1 Vj = 1, M(17) 

1=1 
JW 

Exactly one edge out from each node: yij = 1 Vi = 1, M(18) 

Af 

Flow coming back to the initial point at the end of the loop is p{xi): Zj^i = i?(xi)(19) 

i=l 

Change of flow after crossing node k is either p{xk) or it is p{xi) minus the sum of p's: 

i — 1 j — 1 

Connects flows z to indicators of edge y: Zij < rijyij (21) 

p{xi) j = 1 

where nj = ^ Y^fiiPi^i) ^ = 1 

^^2P(^i) otherwise 



Constraints (15) and (16) restrict self-loops from forming. Constraints ( |17[ ) and (18) impose 
that every node should have exactly one edge coming in and one going out. Constraint 
represents the flow on the last edge coming back to the starting node. Constraint ( 



quantifies the flow change after traversing a node k. Constraint (21) represents an upper 
bound on Zij relating it to the corresponding binary variable yij. 

3.2 Mixed integer optimization for Cost 2 



Here we reason about the choice for changing Cost 2 in ([8j) to resemble (10). Starting with 
the sum ^ over costs ([S]), 



1 - M +e^^^''-w-' 



mm 

■K 

1 = 1 



we apply the log function to the cost of each node ([8]) to get a new cost 

1 - log ( 1 + e^^^^''^'^' 



and the minimization becomes instead: 

M 

mm 

i=l 



IVl / 

^(l-log (l + e 



/A(S,r(i)) 



-L^(7r(i)) 
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/ M 

■ max I ^(-L^(7r(i)) log (l + e-^^(*-w) ') - M 



M 



min^ L^(7r(i)) log (l + e^^(*-«M + M. 



i=l 



which is the sum over nodes of the expression (10). This graph traversal cost term is now 

a weighted sum of latencies (+M) where the weights are of the form log ^1 + e-^-^^^'^f')^^ . 

We can thus reuse the mixed integer program (14)-(21) where the weights are defined as 
p{xi) := log (l + e^'^*). 

The sequential formulation has now been completely defined for both Cost 1 and Cost 

2. 



3.3 Solvers for the vi^eighted TRP subproblem 

A generic MILP solver hke CPLEXQor GurobQ can produce an exact solution using branch- 
and-bound or other related exact methods. In our experiments we use Gurobi. The weighted 
TRP problem is NP-hard (can be shown by a reduction of the hamiltonian cycle problem) 
and hence most likely not solvable by polynomial-time algorithms. The standard (un- 
weighted - all weights equal) TRP can be encoded by different mixed-integer programming 



formulations (see Fischetti et al. , 1993; Eijl van, 1995 Mendez-Diaz et al. 2008) each with 



different performance guarantees (e.g., solving 15-60 nodes), which could be adapted for 
our purpose. There are also techniques for producing constant factor approximate solutions 
to the unweighted TRP, which could run faster than the MILP solvers for large problem 
instances. If the weights are integers, we can adapt these faster techniques for the standard 
problem to the weighed TRP problem by replicating the respective nodes by Wi times where 
Wi is equal to the weight of node i. If the weights are rational (as is the case in (22) and 



(23)), a rounding and discretization trick is one way to map back to the standard solution 
techniques. More on this topic is discussed in Section |6j 

3.4 Mixed-integer nonlinear programs (MINLPs) 

For the simultaneous formulation, the inputs to the program are training data {xi,yi\^i, 
unlabeled nodes {xi}^^^ the distances between them {dij}^^^^^ and constants Ci and C2. 
The full objective using Cost 1 is: 

' m M M \ 

J^ln (1 + e-y^f-^^A + C2\\\\\l + C,Y.Y.'^.,z.d ^-t- 



mm 



i=l 



i=l j=l 



constraints (15) to (21) hold, where p{x 



1 + e 



— X-Xi 



1. IBM ILOG CPLEX Optimization Studio vl2.2.0.2 2010 

2. Gurobi Optimizer v3.0, Gurobi Optimization, Inc. 2010 
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or equivalently, 

(m 
Vlnfl + e-^^-^^(^>)) +C2||A||| + C7i min V V c^i 



M M 



constraints (15) to (21) hold, where p{x. 



J'^ij I S.t. 



1 + e-\-Xi ■ 



(22) 



The full objective using the modified version of Cost 2 is: 



M M 



i=l 



min ^ln(l + e-^>-^^(^>M +C2||A||i + Ci min ^ f^i„ 



^""^'^'y^-^^ i=i ,=1 



s.t. 



(23) 



constraints (15) to (21) hold, where p{xi) = log ( 1 + e'^'^' 



If we have an algorithm for solving (22), then the same scheme can be used to solve (23). 



There are multiple ways of solving (or approximately solving) a mixed integer nonlinear op- 
timization problem of the form ( |22| ) or (23). We consider three methods here, described 
next. The first method is to directly use a generic mixed integer non- linear programming 
(MINLP) solver. The second and third methods (called Nelder-Mead and Alternating Min- 
imization, denoted NM and AM respectively) are iterative schemes over the A parameter 
space. At every iteration of these algorithms, we will need to evaluate the objective func- 
tion. This evaluation involves solving the weighted TRP subproblem, as discussed in the 
previous subsections. 

Method 1: MINLP Solver 



For our experiments we directly use a MINLP solver called Bonmin (Bonami et al. , 2008). 



These types of solvers typically use general MILP solving techniques like branch and bound 
or dynamic programming interleaved with continuous optimization. Since the general MILP 
solving techniques, as discussed, can take exponential time when applied directly to our 
formulations, the MINLP solvers which use them can in turn, be inefficient if the graph is 
moderate to large in size. However, when the graph is small, for instance when we want 
to schedule a tour over a small time period with only a few nodes, the MINLP solver can 



directly compute a solution to the problems (22) or (23) in manageable time. 



Method 2: Nelder-Mead in A-space (NM) 



The Nelder-Mead minimization algorithm requires only function evaluations (Nelder and 



Mead, 1965). The ML&TRP can be viewed as a minimization in the space of all A vec- 



tors; since we have solvers for the weighted TRP subproblem, we are able to evaluate 
the ML&TRP objective for a given value of A. In our experiments we use the MILP solver 
(Gurobi) for the subproblem. Note that the ML&TRP objective can have non-differentiable 
kinks arising from discontinuities in the graph traversal cost term; a method which relies on 
the gradient or the Hessian of the objective function might get stuck. By using only function 
values, NM may be able to bypass this type of situation. The generic Nelder-Mead scheme 



can have disadvantages Rios (2009) with respect to performance. If so, other schemes like 
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Multilevel Coordinated Search (MCS) [Huyer and Neumaier ( 1999 ) can be used in place of 



Nelder-Mead. Note that since the objective is nonlinear, all solutions obtained by NM are 
only locally optimal. 

Method 3: Alternating minimization in A-vr space (AM) 
Define Obj as follows: 

Obj(A, vr) = TrainingError(/A, {xj, yj}^i)+CiGraphTraversalCost (vr, fx, {^i}f£i, {di,j}fj=i) 

(24) 

We propose a heuristic minimization algorithm, where starting from an initial vector Aq, 
Obj is minimized alternately with respect to A and then with respect to tt, as shown in 
Algorithm [T} The second step, solving for vr, is the same as solving the TRP subproblem, 
and we again use the MILP solver for this. 

Algorithm 1 AM: Alternating minimization algorithm 

Inputs: {xi,yi}'^, {xi}f^ , {dij}ij, Ci, C2, T and initial vector Aq. 
for t=l:T do 

Compute TTt G argmin^gnObj(Aj_i, vr). 

Compute At G argmin;^gjgdObj(A, vr^). 
end for 
Output: vr^^. 



Conditions for convergence and correctness for such iterative schemes is given by Csiszar 



and Tusnady (1984); only locally optimal solutions can be found using this method. 



4. Experiments 

One of the major goals of this work is to be able to produce low cost solutions that are still 
high quality; these models can explain the variance in the training data, while promoting 
the prior belief that the cost will be low for carrying out the model's recommendations. 
The sequential formulation will not necessarily be able to accomplish this: the best possible 
minimizer of the training error will not necessarily yield a low cost solution. This point is 
made through some illustrations that follow next. We then evaluate the two formulations 
with respect to both accuracy on a test set and cost of the route, for a set of features derived 
from data from New York City's secondary electrical distribution network. 



4.1 Illustrations 

Our first illustration shows how a small change in the probabilities can give a completely 
different route and change the traversal cost. The graph G given by {dij}ij is shown in 
Figure [1] The number of unlabeled nodes is M = 4, xi, . . . ,X4 € M^, shown in Figure [2j 
The training features are also in the plane, Xi G M^, and are represented by two gray circles 
in Figure [2] (for instance, the distributions could be Normal with diameters representing 
their corresponding variance). Note that it is important not to confuse the feature space of 
Xj's with the space that the graph {dij}ij is embedded in, these are different spaces, and 
the ML&TRP graph need not even have a physical distance interpretation. 
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Figure 1: Physical space for the four node iUustration. The weights on the edges represent 
the distance dij. The optimal route as determined by the sequential formulation is 
highlighted in (a), (b) shows a route determined by the simultaneous formulation. 




Figure 2: Feature space for the four node illustration. 



The sequential formulation produces a function whose 0.5-probability level set is dis- 
played (in feature space) as a black line in Figure [2j The route corresponding to that solution 
is given in Figure [l][|a), which Istt* = 1 — 3 — 2 — 4—1. Consider instead what might happen 
if we used the simultaneous formulation. If we were to move the 0.5-probability level set 
slightly, for instance to the dashed line in Figure [2| the probability estimates on the finite 
training set would change only slightly, but the route would change entirely. The new route 
would be IT new = 1 — 3 — 4 — 2 — 1, and it would yield a lower value of Cost 1 (decrease 
by ~ 16.4%). In both cases, the probability estimators may have very similar validation 
performance, so a solution from the simultaneous formulation might be preferred. 
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For our second illustration, we consider a new set of experiments with a similar setup 
as before but now with the number of nodes on the graph equal to 6. The training set was 
chosen uniformly at random from a distribution that is uniform over two triangles pointing 
end to end. Again the training data is finite, so that the level set can be moved, yielding 
almost the same probability estimates on the training set but accommodating lower costing 
routes. Figure [3] shows the training instances and unlabeled instances in feature space along 
with two level sets. The first one, colored black drawn at probability estimate 0.5 is learned 
from (^2-regularized) logistic regression. The second level set, colored red and also drawn at 
probability estimate 0.5 is learned from the new simultaneous formulation. The new level 
set was obtained from the simultaneous formulation with graph cost modeled according to 
Cost 1 (with an appropriately chosen coefficient Ci). Node 6 lies in a low density region 
of feature space, so its probability cannot be well estimated. In the sequential formulation, 
node 6 which was assigned p{xq) = 0.5. The optimal route thus obtained by solving the 
weighted TRP problem in the second step is 1 — 2 — 3 — 6 — 4 — 5 — 1 shown in Figure |4j In the 
simultaneous formulation, node 6 has been assigned a new probability value p{xq) = 0.29. 
This big change is possible because its probability estimate can vary quite a lot without 
changing the probability estimates of others. This changes the route to 1 — 2 — 3 — 4 — 5 — 6 — 1 
as shown in Figure [5] Here, we see that node 6 is also physically far from all other nodes. 
If it has a high enough probability estimate compared to nodes 4 and 5 (blue triangles in 
the lower left half of Figure |3]) , then a route that visits node 6 before visiting nodes 4 and 5 
would be favored; this is what happens in the sequential formulation. In the simultaneous 
formulation, we chose Ci large enough so that the tour route visits 4 and 5 before 6. This 
results in ~ 9% decrease in the route cost (Cost 1). 

Using the data from the second illustration, the (^2-i'egularized) training error is plotted 
in Figure [6(1)1 The axes are the first two coordinates of the A parameter vector. The 
optimal graph traversal cost (Cost 1) was computed for each value of A and is plotted in 
Figure |6(b)[ for each point, a weighted TRP subproblem was solved. The simultaneous 
ML&TRP objective is the sum of the values in Figures 6(a)| and 6(b) , and the constant Ci 
controls how these surfaces are added together. If the training error term in Figure 6(a) is 



somewhat flat near the minimizer of the ML&TRP objective and the graph traversal term 
in |6(b) is not flat, the graph traversal term may be able to have a substantial effect on the 
solution. 



4.2 ML&TRP on the NYC power grid 

We now illustrate the performance of our method on a data set obtained from a collaborative 
effort with Con Edison, which is NYC's power utility company. More details about these 



data can be found in (Rudin et al. 2010). This dataset was developed in order to assist Con 
Edison with its maintenance and repair programs on the secondary electrical distribution 
network in NYC; specifically, it was designed for the purpose of predicting manhole fires and 
explosions. We chose to use all manholes from the Bronx (~23K manholes). Each manhole 
is represented by features that encode the number and type of electrical cables entering 
the manhole and the number and type of past events involving the manhole (e.g., if the 
manhole was the source of partial outages, full outages and/or underground burnouts). The 
training features encode events prior to 2008, and the training labels are 1 if the manhole 
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Figure 3: Plotting {x-i}^^ and {xj}^^ in the feature space. Two level sets corresponding 
to 0.5 probability are also shown. The first (black) is obtained by minimizing the 
training error using the sequential method. The second (red) is the 0.5 probability 
level set obtained from the simultaneous formulation, and illustrates the effect of 
the graph traversal cost regularization term on the decision boundary. 
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ure 4: The weights on the edges represent the distance dij. The optimal route 1 — 2 — 
3 — 6 — 4 — 5 — 1 as determined by the sequential formulation is highlighted. The 
route cost (Cost 1) is 4.7 units (scaled by Ci = 0.001) and the training cost is 
15.7 units. The values of p{xi) are shown in the node circles. 




ure 5: The optimal route 1 — 2 — 3 — 4 — 5 — 6 — 1 as determined by the simultaneous 
formulation is highlighted. The route cost (Cost 1) is now 4.25 units (scaled by 
Ci = 0.001) and the training cost is 16.2 units. 
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(a) (b) 

Figure 6: The values of the two terms in the objective of the simultaneous ML&TRP for- 
mulation, (a) Training error as a function of {Ai, A2}. The last coordinate, A3 is 
kept fixed, (b) Scaled optimal graph traversal cost (Cost 1 divided by 100) over 
a 2D grid of Ai and A2, again with A3 fixed. 



was the source of a serious event (fire, explosion, smoke) during 2008. The nodes are 7 
randomly chosen manholes, and the features for the nodes encode events prior to 2009. The 
prediction task is to predict events in 2009. The test set (for evaluating the performance 
of the predictive model) consists of features derived from the time period before 2009, and 
labels from 2009. Predicting manhole events is a difficult task for machine learning, because 
one cannot necessarily predict an event using the available data. The operational task was 
to design a route for a repair crew that is fixing the nodes. The choice of M = 7 nodes was 
only to speed up the computation time. Limitation on the number of nodes for which the 
TRP problem can be solved efficiently directly affects the number of nodes which we can 
pick for the unlabeled set. This bound is an order of magnitude more than the choice made 
here. 

The distances between the nodes were obtained from Google Maps, by querying the 
driving distance between each pair of nodes. Note that we do not want 'flying' distance 
between two coordinates as this can be very different from the actual driving distance. 
Manhole failures are rare events. This means that we have many more negative labels 
than positive labels. Because of the large class imbalance, using a logistic model gives us 
probability estimates which are low overall, so the misclassification error is almost always 
the size of the whole positive class. To avoid this, we chose to evaluate the quality of the 
predictions from fx* using the area under the ROC curve (AUC), for both training and 
test. The quality of the route is indicated by computing the optimal route cost at A*. 

Figures [7] and [8] show how the AUC values change with respect to the coefficient Ci 
of the graph traversal cost term in the objectives of Cost 1 and Cost 2. The algorithms 
used here are the Nelder-Mead method with the MILP solver for the subproblem, and the 
alternating minimization method (AM) again with the MILP solver. Having the graph 
traversal cost as a regularizer lowers predictor /a*'s AUC values on the training data, as 
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Figure 7: The AUG values corresponding to model parameters obtained from the simulta- 
neous formulation using Cost 1 by NM-MILP and AM-MILP algorithms along 
with MINLP solver, plotted as a function of Ci. The AUG values on the train- 
ing data decrease slightly and the same values for test data increase marginally. 
The two horizontal lines represent the training and test AUG values obtained by 
£2-penalized logistic regression and thus, are constant with respect to Ci. 



expected. In this case, the performance on the test data is basically unchanged. We use a 
total of 4 features (that is, Xi G M^). On the other hand, a related work on the same dataset 
(Rudin et al. , 2011) uses more number of features and get about 10% increase in the AUG 
values. The increase in training error for the simultaneous formulation (using Gost 1) as a 
function of Ci is shown also in Figure l9| The decrease in graph traversal cost as a function 



of Ci is shown in Figure 10 



The naive route obtained by estimating probabilities using ^2-penalized logistic regres- 
sion, and then simply visiting nodes according to decreasing values of these probabilities 



is shown in Figure [TT| Figure 12 shows the route provided by the sequential formulation. 
For the simultaneous method, there are changes in the route as the coefficient Ci increases. 
When Ci is low, the route is the same as obtained from the sequential method, in Figure 



12 When the graph traversal term starts influencing the optimal solution of the objective 



(22) because of an increase in Ci, we get a new route, depicted in Figure 13 



We experimented with a large range of values for the regularization parameter Ci , with 
the goal of seeing a large range of possible results. We chose AUG as the evaluation metric, 
which is a measure of ranking quality; it is sensitive to the rank-ordering of the nodes in 
order of their probability to fail, and it is not as sensitive to changes in the values of these 
probabilities. This means that as the parameter Ci increases, the estimated probability 
values will tend to decrease, and thus the graph traversal cost will decrease; however, it 
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Figure 8: The AUG values obtained from the simultaneous formulation, using Cost 2, from 
the NM-MILP and AM-MILP algorithms along with the MINLP solver, plotted 
as a function of Ci. Again, the training data AUG values decrease and the test 
data AUG values remain nearly constant. The horizontal lines represent constant 
values of the AUG obtained by £2-penalized logistic regression. 
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Figure 9: The ^2-i'egularized logistic loss increases as a function of increasing Ci. The 
horizontal line represents the loss value from ^2-penalized logistic regression with 
no regularization (Ci = 0). 
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Figure 10: The graph traversal costs decrease as a function of the regularization parameter 
Ci. The horizontal lines in the figure represent the sequential formulation solu- 
tions; the lower horizontal line is Cost 1 of the solution obtained by ^2-penalized 
logistic regression, and the upper line is Cost 2 of that solution. 



22 



Machine Learning & the Traveling Repairman 




Figure 11: A naive route: 1-5-4-3-2-6-7-1 obtained by sorting the probability estimates in 
decreasing order and visiting the corresponding nodes. 




Figure 12: Sequential formulation route: 1-5-3-4-2-6-7-1. The simultaneous formulation 
also chooses this route when Ci is small. 
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Figure 13: Route chosen by the simultaneous formulation when Ci is larger: 1-6-7-5-3-4-2- 
1. Prediction performance is only slightly influenced by the route change, but 
the cost of the route (Cost 1) decreases a lot. 



may be possible for this to happen without impacting the prediction quality as measured 
by the AUG, but this depends on the routes and it is not guaranteed. In our experiments, 
for both training and test we had a large sample (~23K examples). The test AUG values 
for the simultaneous method were all within 1% of the values obtained by the sequential 
method; this is true for both Gost 1 and Gost 2, for each of the AM, NM, and MINLP 
solvers. This means that the AUG prediction quality did not decrease as a result of using 
the new simultaneous method. The variation in training error across the methods was also 
small, about 2%. On the other hand, as expected, the graph traversal costs varied widely 
over the different methods and settings of Ci, as a result of the decrease in the probability 
estimates, as shown in Figure [TOj There is a range of realistic probability estimates, but 
towards the right of the plot (for instance when Ci > .85), the probability estimates are 
probably too low to be realistic and the costs are substantially underestimated. Let us 
compare the values of Gost 1 for our experiments in the more realistic range: Gost 1 of the 
naive method was 24.5% higher than that of the sequential method; As Ci was increased 
from 0.05 to 0.5, Gost 1 went from 27.5 units to 3.2 units, which is over eight times smaller. 
This means that with a 1-2% variation in the predictive model's AUG, the graph traversal 
cost can decrease a lot, potentially yielding a more cost-effective route for inspection and/or 
repair work, by favoring the cost to be underestimated when there is uncertainty. 

In most applications relevant to this problem, we suspect that the solution used in 
practice is somewhere in between the naive route and the sequential route, in that a human 
views the naive solution and adjusts it by hand to be closer to the sequential route (without 
solving the TRP). For the application to electrical grid maintenance, the simultaneous 
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method was able to find a substantially lower cost route that the naive or sequential method, 
with little (if any) change in the AUG prediction quality. 



5. Generalization Bound 

We initially introduced the graph traversal cost regularization term in order to find scenarios 
where the data would support low-cost (more actionable) repair routes. From another 
point of view, incorporating regularization reduces the size of the hypothesis space and 
may thus promote generalization. The size of the hypothesis space can be controlled using 
Ci. Increasing Ci may thus assist in predicting failure probabilities on future inputs x 
using fx{x). Here, it is irrelevant whether a new instance x is incorporated into a physical 
underlying graph, we aim only to estimate P(y = Ijx), where all {xi,yi}'^i and new point 
x, y are chosen independently at random from unknown distribution nxxy- In what follows. 



we will provide a generalization bound for the ML&TRP algorithm ( 22 ) with Cost 1 using 



an upper bound on Cost 1 to limit the size of the hypothesis space. A similar bound for 



(23) might be derived using the same method. 

Generalization bounds are probabilistic guarantees that are useful for showing what vari- 



ables may be important in the generalization process (Bousquet, 2003). The vast majority 
of works on generalization analysis are mainly interested in problems where the dimen- 
sionality d of the input space (or feature space) is very large, leading to the "curse of 
dimensionality." In that case, various measures of the complexity of the hypothesis space 
are incorporated into the bounds; these complexity measures can often gauge the richness 
of a class of functions in a way that is independent of the input dimension d. Examples of 
such measures include the VC dimension for {0, l}-valued function classes, e-fat shattering 
dimension for real valued functions, Rademacher complexity, and certain kinds of covering 



numbers ( 


Vapnik 


1998'; 'Bartlett and Mendelson 


2002 


Mendelson and Vershynin 


2003 


Zhang 


2002 Shawe- Taylor and Cristianini, 2002 


Kolmogorov and Tikhomirov, 1959). In 



the present work, we are instead interested in how the graph traversal cost influences gen- 
eralization for a fixed d, that is, we are interested in how Ci affects generalization, and not 
so much interested in the dependence on d. We make the assumption that all input features 
affect prediction ability. This means that our bound will depend on the dimensionality of 
the input space d, and that there is no standard complexity measure that can reduce the 
complexity of the class of functions. Having this dependence on d is not uncommon; for 
example, covering number bounds depending on the "Pollard dimension" (equal to input 
dimension d when finite) have been obtained for bounded real-valued functions (see The- 

There are also 



orem 14.21 of Anthony and Bartlett, 1999 



1992) 



Theorem 6 of Haussler 
many bounds that rely directly on the number of elements within the hypothesis space, for 
finite hypothesis spaces. 

We are seeking to bound the true risk 

R{h) := E^,,y)^;^^ylf{x, y) = + e-yf^^^) dfi:^^y{x, y), 

where I f : X x y ^ I f is the logistic loss. We will bound R{fx) by the empirical risk: 



^ m 

R{fx,{x^,y^}T) = -^l 



f[xi,yi) 



^ m 

m ^-^ V 
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plus a complexity term. The complexity term takes into account the limitations on the 
hypothesis space of /, namely that f £ where: 

J- := {/ : f(x) = A • a; for some A G M'^ such that ||A||2 < Mi}. 



Replacing the Lagrange multiplier Ci in (22) with an explicit constraint, / is subject to the 
graph traversal cost constraint: 



M ^ 

where Cg is a constant (inversely related to Ci). 

We assume that the features are bounded, specifically, x G C M"' with sup^g^:^;. ||a;||2 < 
M2. We know /a : — ^ [-MiM2,MiM2\ by the Cauchy-Schwarz inequality since V/a G 
TyxeX,\fxix)\<M,M2. 

Let us define the set of functions that are subject to a constraint on the graph traversal 
cost: 

. 1 



= < / : / G J",miny^L^(i) 77^ < C„ 



where recall L,r(7r(i)), defined in Q is the latency of the node 7r(i), which is the cumulative 
distance traveled on a tour before reaching 7r(i). Our goal in this section will be to show 
that a bound on the complexity of the class J-q may assist generalization. 

Define di to be the shortest distance from the starting node to node i. Here, di is 
the length of the shortest tour that visits all the nodes and returns to node 1 . This means 
di < LT^{i), and this inequality can be tight if the graph can be embedded into 1-dimensional 
Euclidean space (on a line). In what follows, we will fix a vector c, defined element-wise by: 

cP 



Cg - Co 



where 



and 



CO 



It will be important that the vector c depends on Cg. 
The main result follows from these definitions: 
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Theorem 1 (Main Result) Let X = {x £ R'^ : \\x\\2 < M2}, y = {-1,1}. Let he 
defined as above with respect to {xj}^!]^, Xj G X (not necessarily random). Let {xj,yi}^^ be 
a sequence ofm examples drawn independently according to an unknown distribution nxxy- 
Then for any e > 0, 

/32M1M2 Y ( -"^e 



P(3/ G J-o : \R{h,{x^,yi}r) - R{fx)\ > e) < Md,Cg,c) (^^IMlMl + 1^ exp ( 



2 



512(MiM2) 



where 



1 , Ilc|l2 + 32M2 r [1 + 
2 



„YJ ^"l " _l_ " • .^^JH2 ^ ' 2J 77 1 1-d. 3. / 11^112 ^32M2 \ 1 /qcA 



-1 



or equivalently 



aid,Cg,c):=l-lL^ . , , ^ .2.,_ ^ x^f^,^) (26) 



2 l-(ll-ll2-'+32i?^) /(a^I+32%) V 2 ' 2 

and where 2-Fi(a, 6; c; d) anc? Lx{a,b) are the hypergeometric function and the regularized 
incomplete beta functions respectively. 

The term a{d,Cg,c) comes directly from formulas for the volumes of spherical caps. Our 
goal was to establish that generalization can depend on Cg. The value of Cg enters into 
the bound through vector c. As Cg decreases, the norm ||c||2 increases, and thus ||c||2 ^ 



decreases, (26) and (25) decrease, and the whole bound decreases. This indicates that 
decreasing Cg may improve generalization ability. 

We will provide several lemmas leading to the proof of the theorem. The proof idea is 
to enlarge the class of functions just enough so that a bound on the covering number (the 
number of balls required to cover the set J-q) can be constructed. The class corresponds 
to a ball of radius Mi in A-space (a ball in M'^). The class J-q corresponds to a subset of that 
class. We will construct two classes, Ti and J2 that are slightly larger than Tq, but smaller 
than when Cg is small enough. Then we will use a volumetric argument to bound the 
covering number of J-2, which uses the volumes of spherical caps. The idea is to show that 
the value of Cg affects the volume of the hypothesis space, and thus the covering number. 



The covering number bound is then applied to a uniform bound of Pollard ( 1984). The fact 
that the covering number of J-2 can be below that of J- indicates that using functions from 
J-2 may provide improvements in generalization over the set J^. We now proceed with the 
proof. 

We define the ball J-i, using a lower bound on the latencies L-j^^i), namely the minimum 
distances di. We have, for any values of p{xi) > 0: 

'^dip{xi) < ^L^(«)p(xi) < Cg. 

i i 

This means that the class of functions who probabilities obey ^ ■ dip{xi) < Cg is larger 
than the class obeying '^iL-^iijp^Xi) < Cg. That is, J^o C J^i where 



f ^ 1 

1 6t l + e-^(^-(' 



^<Cg 
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As long as Cg < Yld=i di, the constraint in Ti is not vacuous. 

Let Bmi = {A : ||A||2 < Mi} be a closed ball of radius Mi in R'^. By definition, the set 
B]\f-^ are the set of A used for constructing functions T. 

The space of functions J^2 is defined with respect to the vector c (defined above the 
theorem) as follows: 

-^2 :={/a : /a G-^,c-A< 1}. 

The choice of c ensures that J^i is a subset of J-2 as we will prove below. The half space 
corresponding to T2 is 

:={A:c.A<l}. 

The value ||c||^^ will be used in the volumetric argument. 
We provide some common definitions. 

Definition 2 Let A X be an arbitrary set and {X, dist) a (pseudo) metric space. Let \ ■ \ 
denote set size. 

• For any e > 0, an e-cover for A is a finite set J7 C X (not necessarily C A) s.t. 
yx G A,3u £ U with dist{x, u) < e. 

• A is totally bounded if A has a finite e-cover for all e > 0. The covering number of A 
is N{e, A, dist) := iniu \ U\ where U is an e-cover for A. 

• A set R <^ X is e-separated if\/x, y £ R, dist{x, y) > e. The packing number M(e, A, dist) :- 
^^Pr-.rca \ where R is e-separated. 

There is a well-known relationship between packing numbers and covering numbers which 
we will make use of in proving Theorem [7] 

Lemma 3 (Packing and covering numbers) For every (pseudo) metric space {X, dist), 
AQ X, and e > 0, 

N{e,A, dist) < M{€,A, dist). 



Proof See Theorem 4 in Kolmogorov and Tikhomirov (1959) or Theorem 12.1 in Anthony 



and Bartlett (1999) for a proof of this classical result. 



Let /xg represent a probability measure on a set B. Let B he a random variable taking 
values in B according to /Ug, and 6 be a realization of B. Let fi"^ represent the empirical 
measure based on sample i3™ = {bi, 6m}. Let L2(//b) be a space of functions defined on 
set ;S with the metric \\ f — g\\ L2{tJ.B) ~ I if i^) ~ di^))"^ ^^is ■ In what follows, we will define 
to be the input space X or the joint input output space X xy. Also the squared (.2 distance 
will continue to be denoted Sj(A{ — Ag)^ = ||Ai — A2||2- 

Lemma 4 (Relating covering numbers in \\ ■ ||L2(^m) to \\ ■ \\2) 

a. sup^m N{e,T, \\ ■ lUalM^)) ^ N{e/M2,BMi, \\ ■ h) 

b. sup^m A^(e, J"2, II • ||l2(m^ 

))<N{e/M2,BM,nH....-r,\\-\\2). 
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Lemma [4] will be used to tie together two results in the proof of the Theorem [T] to relate the 
covering number of a class of functions to the covering number of a subset of A in M*^. The 
first result is Lemma [5] which shows how the covering number of different function classes 
of interest are related. The second result is Theorem [7] which shows how covering number 
of subsets of M'' can be bounded. 

Proof Each element f £ corresponds to at least one element of Bmi by definition of 
J^. Choose any distribution //^ . Consider two elements A j , G Bmi corresponding to 
functions /, 5 G J" C L2(/i^). Then, 



< 
< 



Consider a minimal e/M2-cover {Ar}r for Bmi where Xr corresponds to function r G J^. 
Then by definition, VA G Bmi,^K : ||A — Ar||2 < e/M2. Thus, picking any two such ele- 
ments A/,Ag in a ball of radius e/M2 around Xr, we see that, the corresponding functions 
f,g belong to a ball of radius e measured using distance in L2(/x^) by the inequality above. 
The centers of these e-balls in L2(/i^) form an e-cover for J^. The size of this set is equal to 
N{e/M2, Bmi, II • lb)- The size of the minimal e-cover of J- is less than or equal to this size, 
N{e,J^, II • ||l2(ai^)) — ^{^/M2, Bm-i, II • lb)- Taking a supremum over all /j,"^, we obtain the 
first inequality of the Lemma. The same argument also works for the second inequality. ■ 

We will upper bound the covering number for J^i with the covering number for the more 
tractable J-2. We need to derive the vector c in such a way that J-2 is a larger class than 
-Fi. 

Lemma 5 (Tq is contained in T2) 

^(ej-^o, II • llL2(Ai™)) < ^(e^-^i) II • IU2(m^)) ^ ^(^1-^2, II • ||l2(m^))- 

Proof It is sufficient to show J-q J~i J~2- The first inequality was discussed earlier; 
since di < inf^gn -^^7r(0) this implies: 

M M 
i=\ 1=1 



i=l 
i=l 

— 1 1 A/ — Ag||2||a;j||2 (Cauchy-Schwarz to each term) 



m 

i=l 



1^/ - -^9ll2 2-/^2 j (since sup ||x||2 < M2) 



^ m \ 

m ) 

i=i / 
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We now show Ti Q J-2- We will show how c and cq were derived so that: 

M M 

c • A + Co < ^ dip{xi) < ^ L.„{i)p{xi) < Cg, 

i=l i=l 

which will allow us to say that the set of A such that c ■ X + cq < Cg is larger than J^i ; this 
set is J-2. We will find c and cq by finding mi and mo such that for any i, 

mi{X- Xi) + mo < p{xi), (27) 

so that c and cq will be defined with respect to mo and mi by: 

^dip{xi) > ^di(mi(A • Xi) + mo) = mi I ^diXi 1 • A + mo ^ c?, =: c • A + cq. (28) 

i i \ i / i 



Let us now define mi and mo in order to obey (27). The condition in (27) is: 

'mif{xi) + mo< 



1 + e-/(*') ' 



Within the range [— M1M2, M1M2] we lower bound the function g{z) = 1/(1 + e ^) by the 
line with slope 



mi = g'{-MiM2) 



^1 + 6^-^1^^2)2 

that intersects the point (— M1M2, (7(— M1M2)), and thus has y-intercept 

gMiM2 I 

mo = MiM2——jj:jj-. + 



'l_|_gMiAf2p 1 _|_ gMiAf2 ■ 



Incorporating this into (28), we have c - A + cq < Ylii dip{xi) < C^, where c is defined element 
wise by 



ixl 



= mi [Y^d^xi^ = \ Y.<^- ^ 

and 

CO = mojjd, = [M,M2 (-^^^M,Af2)2 + i+gAfa/2 J E^- 

Finally, we obtain vector c from c and cq. 



{A : c- A + co < Cg} = {A : c- A < Cg - Co} 

= {A:^rV-A<l} 

<-^g — Co 

=: {A:c-A<l} 



where c is defined element-wise by 

c> 



Cq - Co 
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This is same the definition of c used in the theorem and for T^- Thus, ■ 

Another way to obtain a suitable c such that T2 is a good superset of T\ is to minimize 
the distance of the hyperplane we want to construct from the origin by solving a semi-infinite 



program 29 



max I |c| I2 

c 



s.t. VA G Au A°,c - A < 1 

M ^ 

Where A = {A : A . i^M. , g ^ ^ . = C,} (29) 

M 

and AO = {A : ||A|b = g ^^^^^^^^-^ < C,}. 



One can approximate the two sets A and A^ in the program formulation by discretizing 
the points on them, and the semi-infinite program becomes a non-linear program. Figure 



(14) provides a 2-dimensional illustration of J-,J-i,J-2 and the approximate solution of the 
semi-infinite program. The semi-infinite program yields a tighter bound on J^i but is more 
expensive to compute. 

Because of rotational symmetry of Bmi , the volume cut off by a hyperplane c- A = 1 from 
Bmi is determined only by its distance from the origin, which is l/||c||2. Such a portion 
(or its complement, if smaller) of a ball obtained from slicing it with a hyperplane is called 
a spherical cap. It can be parameterized by the distance of its (hyper)plane base from the 
center of the ball. 

Let the volume of a set ^ C M'^ be represented as Vol{A). For example, Vol{Bi) = 



r[d/2+i] • 

Lemma 6 (Volume of spherical caps) Let the volume of ball Bmi in be denoted as 
VoI{Bmi)- Let Hz = {A : c-A < l,||c||2^ = z} be a half space parameterized by z. Let 
the spherical cap be denoted by Bmi H H'^ where the cap is at a distance z ( measured from 
the base of the cap to the center of the ball), and H'^ represents the complement half space 
(Hz UH'z= R'^). Then, 



VoI{Bm, n Hi) = VoI{Bm,) - ik^^:lir]2F, (i, V; i; (ilr 

where 2Fi{a,b] c; d) is the hypergeometric function. Alternatively, 

VoI{Bm, n Hi) = Vol{BM,)\h^zVMl \) 
where Ix{e, f) is the regularized incomplete beta function. 
Proof See Li] (2011) and references therein. 
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10 110 

z: Distance of Hz from the origin 



Figure 15: Normalized volume of a unit ^2-ball intersected with a halfspace iy ol[B\r\iizy)^ 
as a function of the distance of the hyperplane from the center of the ball. This 
also illustrates the dependence of a(d^ Cg, c) on Cg. 



Note that for < z < Mi,Vol{BM,nH',) < VoI{Bm,)- liVol{BM,r\H',) < \VoI{Bm^), 
then the volume of the spherical cap reduces with increasing dimension d. This has an 
important effect on the covering number as a function of dimension d, and ultimately on 



generalization too. Figure 15 illustrates this point by showing the volume on one side of the 



hyperplane as the hyperplane moves through the ball, for various values of the dimension d. 
If d is fairly large, then the volume decreases dramatically as the hyperplane passes through 
the center of the ball. 

We now use the volume of the spherical cap in Lemma [6] to bound the covering numbers 
of subsets of W^, noting that the relationship between the spherical cap and its complement 
is: Vol {Bm, n Hi) = VoI{Bm,) - Vol {Bm, n H,). 

Theorem 7 (Bound on Covering Numbers) 



Nie/M2,BM„\ 



A^(e/M2,5MinF||^l|-i|| • II2) < 



/1 



Vol[B,j^^.nH, 



Vol ( Bm,- 
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Proof Both statements involve a volumetric argument. There are various versions of 



proof for the first part. For example, see Section 3 of Kolmogorov and Tikhomirov (1959), 



Lemma 4.10 in Pisier (1989), Lorentz (1966) and Lemma 3 in Cucker and Smale (2002) 



among others. We will provide an argument along these lines. Let Ai, .., Am be an optimal 
e-packing for Bmi- That is, M = M{e,BMi, \\ • lb)- The volume of (an extra e/2 

added so that the packing elements can lie within the boundary) is. 



Vol{B 



Mi+e/2) 



VoliBi){Mi + e/2y 



where Vol{Bi) is the volume of a unit ball in dimension d. The volume of an e/2 ball with 
packing element Aj as the center is: 

Vol{B,f2 + = Vol{B,,2) = Vol{Bi){e/2f. 

Since the sum of the volume of the e/2 balls should be less than or equal to the volume of 
the extended ball -BAfi+e/2 (else one of the packing elements Aj will be outside the boundary 
of Bmx contradicting the definition of packing) we have: 



M{e,B 



\2)Vol{Bi){e/2f < Vol{Bi){Mi + e/2) 



Scaling e to e/M2 and using the inequality between minimal covering and maximal packing 
numbers from Lemma [3] we obtain the first stated result. 

To show the second part, let the volume of the complement of the spherical cap be 
VoI{Bmi '^-^||c||"0; need to find an upper bound for the minimal e/M2-cover of this set. 
We can do that by scaling a minimal e-cover, which we find now. By extending the boundary 
of Bmi '^-^||c||~^ ^/^ '^^^ bound the maximal packing number M(e, Bmi '^-^||c||"^ ' II ' II2) 
as follows. 



M(e,i?A,, n/^ii • \\2)VoliB,)ie/2)' < VoliBM,+e/2 n H. 



Vol [BM^+e/2 n ^f||d|-i+./2 ) \ 1 



Vol{Bi) {€/2Y 



M(e,5Afin//||^ll-i,||-||2) < 

Vol ( BM^+e/2 n ^^||c||-l+./2 ) \ 1 (Ml + e/2)'^' 

VoUjh) I ^ (Ml + e/2)'^ 

{ BMi+e/2^ H^\c\\-^_^,/^ \ (Ml + e/2)'^ 
VoI{Bm,+,/2) j (.e/2r ■ 

Again, scaling e to e/M2 and using the relationship between A^(e, A, dist) and M(e, A, dist) 
in Lemma [3] yields the second result. ■ 

We are now done with the covering number and volumetric arguments. What remains 
is to show how a uniform generalization bound can be adapted to handle covering numbers. 
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We initially concern ourselves with the class of loss functions Ij^ '■= {If ■ f £ J-}, and adapt 
the bound to handle classes J-2 and J-q. The form and proof of the convergence in terms 
of the size of Ijr becomes simpler when each If € Ijr is non- negative and bounded. This 
is indeed the case since the logistic loss is non-negative and bounded (the latter because 
and its subsets are bounded sets of linear functions) . We will use the following uniform 
convergence bound of |Pollard| ( [T984| ). 



Theorem 8 (Pollard 1984) Let Ijr be a set of functions on X x y with < lf{x,y) < 
Mbound-,^lf £ It and\/{x,y) £ X xy. Let {xi,yi}^ be a sequence of m examples drawn 
independently according to fixxy- Then for any e > 0, 



P{3lf G Ir : Wx.ixi.y.Y^') - R{fx)\ > e) < 4E[iV(e/16, || • ))]exp 



-me 



Pollard 


(1984) (also in 


Zhang 


2002 



constants have been refined in other works since the first result and we have left the original 
constants intact here. ■ 



We can relate the covering numbers for Pollard's Ijr and covering numbers for T as 
follows. 

Lemma 9 (Relating Ijr to P) If every function from function class Ijr represented as 
I : f{X) X y M, f £ is Lipschitz in its first argument with Lipschitz constant L, then 
the covering number of Ijr is related to the covering number of T as 

sup N{e,l:F,\\ ■ ||Li(/.-^y)) < N{e/C,P,\\ ■ hii,,^)) 

Proof Consider two functions f,g £ J-. Let the corresponding functions in class Ijr be 
hi^^y) = Kf{^)^y) and Ig = l{g{x),y). 



= Y] \lf{xi,yi) - lg{xi,y.i)\ = — V \l{f{xi),yi) - l{g{xi),yi)\ 

4=1 i=\ 
^ m 



1=1 



This implies, given {X, y}™, if is a minimal e/£-cover of T in Li(/i^), we can construct 
an e-cover of Ij in Li(^^^-y) as 

lT = {lf,.--fi^T?i 



The logistic loss log(l + e yf(^^) when viewed as a function of f{x) has a Lipschitz 



constant C < 1. For a similar result using the squared loss see Lemma 17.4 of Anthony and 



Bartlett (1999) 
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Theorem |8] involves an Li covering number, but our volumetric argument is in terms 
of an L2 covering number. The following lemma applies the statement ||/ — g\\L-l{^^^) ^ 
ll/~5'llL2(/i^) (true because of Jensen's inequality applied to norms) to the covering numbers. 

Lemma 10 N{e, A, \\ ■ |Ui(^™)) < N{e, A, \\ ■ ||l2(;.-)). 



Proof See for a version, Lemma 10.5 in Anthony and Bartlett (1999). 



Finally, we can prove the main result. 
Proof ( Of Theorem^ Starting from the expectation term on the right hand side of The- 
orem IHl 



ii;[iV(e/16,/^„,||-||z.,(^™^^))] 

< E[N{e/16,ljr2, II • hiifiv^^y))] from Lemma[5] 



< sup N{e/16,ljr^, II • ||l^(u" )) bounding expectation by supremum 



- (llz''^^' " ' Lemmad 

< supiV (-^,^2, II • ||l2(m^)) from Lemma[T0] 



< N ( — -, Bmi n , II • II 2 ) from Lemma l4| and substituting C = 1 

\16 • 1 • M2 " "2 J ' 



< 



d 



^ ^ — - + 1^ from Theorem [7| 



= a{d,Cg,c) ^ — - + 1^ from Lemma[6l 



The above step uses the relationship between the spherical cap and its complement along 
with Lemma [61 



result. 



Vol (^Bm, n //J^||_i) = VoI{Bm,) - Vol (Baa n H^^^^^-i 
Using the bound on E[N{e/16, Ij^^, \\ ■ llLi(At™^y))] obtained above in Theorem jsj gives the 



6. Discussion and Related Works 

In this work, we present a machine learning algorithm that takes into account the way its 
recommendations will be ultimately used. This algorithm takes advantage of uncertainty 
in the model in order to potentially find a much more practical solution. Including these 
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operating costs is a new way of incorporating "structure" into machine learning algorithms, 
and we plan to explore this in other ways in ongoing work. One main focus of the work is 
to discuss a tradeoff between training error and operational cost. In doing so, we showed 
a new way in which data dependent regularization can influence an algorithm's prediction 
ability, formalized through generalization bounds. 

There is a vast literature on regularization, but in the past it has been used to impose 
prior beliefs (e.g., "structure" such as sparsity like Tibshirani, 1996, shrinking certain coeffi- 
cients towards each other), robustness (e.g., to obtain a large "margin" which is the distance 
from the decision boundary to the nearest training example like Vapnik, 1998), or additional 
distributional information (semi-supervised learning, see for instance Chapelle et al. , 2006). 
Out of these, only semi-supervised learning uses unlabeled data, but our problem differs in 
that our unlabeled data does not need to be drawn from the same distribution as the train- 
ing data, and thus does not necessarily provide any distributional information. It provides 
instead information about the practical cost of following the algorithm's recommendations. 

In addition to the above, we developed cost models that apply to routing problems. For 
the power grid application and other maintenance applications, {dij}ij in ^ correspond 
to physical distances. It is possible to use the techniques developed here for more abstract 
routing problems, for instance, network scheduling or network routing problems, where 
distance on the graph does not necessarily correspond to a physical distance. There are 
other works that schedule events based on a linearly increasing cost model (see for instance 



Anily et al., 1998). 



There is a body of literature regarding cost models for maintenance in the reliability 
modeling literature, though the emphasis in those works is usually to design a model that 
accurately represents the stochastic process for the failures. In particular, there are works on 
condition-based maintenance, where a maintenance schedule is created from the predicted 
condition of the equipment (but not on the cost of performing the repairs in a certain order 
or routing a vehicle between the equipment). Barbera et al. (1996) develop a model that 
assumes that equipment have exponential rates of failure and fail only once in an inspec- 



tion interval, and they use this model to determine a maintenance schedule. Marseguerra 



et al. (2002) introduces a model for degradation leading to failure for a continuous complex 



system, and use Monte Carlo simulations to determine the optimal degradation level to 
perform an inspection. Their work uses a very different cost model from ours; the cost is 
the long run average maintenance cost and cost of failures. A neural-network based main- 
tenance model was developed by Heng et al. (2009). Another large body of work considers 
more sophisticated estimates for system faults: for example, by modeling (repeat) measure- 
ments as time series (Xu et al. , 2009). Depending on the application, one could replace the 
training error in our model with a more elaborate failure model such as the ones developed 
in these works. 

If we were able to find an efficient method for approximately solving the TRP subprob- 
lem, it could allow us to compute solutions to the ML&TRP significantly faster. Constant 
factor approximation algorithms for the standard (unweighted) TRP have been developed 

Blum et al. , 1994 Arora and Karakostas| 



in several works (Goemans and Kleinberg 



2006 : Archer et al. 



2008 



1998 



Archer and Blasiak, 2010). These schemes typically have (quasi-) 



polynomial time guarantees and approximate up to a constant ratio of the optimal standard 
TRP objective value. The constant factors are at least 3.59 or above. Heuristic methods 
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might also be used for solving the standard TRP and related problems (Dewilde et al. 



2010, Salehipour et al. , 2010), which can potentially be adapted to solve the weighted TRP. 



There are some difficulties in doing this because the heuristics depend on the exact way 
the cost is defined. For example, Dewilde et al. (2010) solve a variation of the TRP which 



cannot easily be adapted for solving the weighted TRP problem. Lechmann (2009) has a 
survey of the various applications and solution techniques of the different versions of TRP 
problem. 

There could be many variations on the setup for the ML&TRP. In some applications, 
real time sensor measurements are available, and it is possible to automatically turn off 
the equipment when it fails in order to prevent more failures from occurring. This is not 
possible for the power grid application, since it is not possible (and not desirable) to turn 
off the electricity supply in the secondary electrical distribution network, but it may be 
possible in other applications. 

A related work on routing for emergency maintenance on the electrical grid is the heuris- 
tic algorithm of Weintraub et al. ( 1999 ) that dispatches vehicles to areas where there are 
currently breakdowns and where there are likely to be breakdowns in the future. 
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